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Abstract 

Passive  sensing  of  human  hand  and  limb  motion  is  important  for  a  wide  range  of 
applications  from  human-computer  interaction  to  athletic  performance  measurement. 
High  degree  of  freedom  articulated  mechanisms  like  the  human  hand  are  difficult  to 
track  because  of  their  large  state  space  and  complex  image  appearance.  This  article 
describes  a  model-based  hand  tracking  system,  called  DigitEye* .  that  can  recover  the 
state  of  a  27  DOF  hand  model  from  gray  scale  images  at  speeds  of  up  to  10  Hz.  We 
employ  kinematic  and  geometric  hand  models,  along  with  a  high  temporal  sampling 
rate,  to  decompose  global  image  patterns  into  incremental,  local  motions  of  simple 
shapes.  Hand  pose  and  joint  angles  are  estimated  from  line  and  point  features  extracted 
from  images  of  unmarked,  unadorned  hands,  taken  from  one  or  more  viewpoints.  We 
present  some  preliminary  results  on  a  3D  mouse  interface  based  on  the  Digit Eyt » 
sensor. 
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1  Introduction 


Sensing  of  human  hand  ami  limb  motion  is  important  in  applications  from  Human-Computer 
Interaction  (MCI)  to  athletic  performance  measurement.  Current  commercially  available 
solutions  are  invasive,  and  require  the  user  to  don  gloves  [ID]  or  wear  targets  (10).  t  his  paper 
describes  a  noninvasive  visual  hand  tracking  system,  railed  Digit  ligrs.  We  have  demonstrated 
hand  tracking  at  speeds  of  up  to  10  Hz  using  line  and  point  features  extracted  from  gray 
scale  images  of  unadorned,  unmarked  hands. 

Most  previous  real-time  visual  3D  tracking  work  has  addressml  objects  with  6  or  7  spatial 
degrees  of  freedom  (DOF){7.  9].  We  present  tracking  r«*sults  for  branched  kinematic  chains 
with  as  many  as  27  DOF  (in  the  case  of  a  human  hand  model).  We  show  that  simple,  useful 
features  can  be  extracted  from  natural  images  of  the  human  hand.  While  difficult  problems 
still  remain  in  tracking  through  occlusions  and  across  complicated  backgrounds,  these  results 
demonstrate  the  potential  of  vision-based  human  motion  sensing. 

This  paper  has  two  parts.  First,  we  describe  the  3D  visual  tracking  problem  for  objects 
with  kinematic  chains.  Second,  we  show  experimental  results  of  tracking  a  27  DOF  hand 
model  using  two  cameras,  and  describe  a  simple  3D  mouse  interface  using  a  single  camera. 

2  The  Articulated  Mechanism  Tracking  Problem 

Visual  tracking  is  a  sequential  estimation  problem:  given  an  image  sequence,  recover  the 
time-varying  state  of  the  world  [7.  9.  IS],  The  solution  lias  three  basic  components:  state 
model,  feature  measurement,  and  state  estimation.  I  lie  state  model  specifies  a  mapping 
from  a  state  space,  which  characterizes  all  possible  spatial  coutigurnlious  of  the  mechanism, 
to  a  feature  space.  For  the  hand,  the  state  space  encodes  the  pose  of  the  palm  (seven  states 
for  quaternion  rotation  ami  translation)  and  t lie  joint  angles  of  the  lingers  (lour  states  pet- 
finger.  five  for  the  thumb),  ami  is  mapped  to  a  set  of  image  lines  and  points  by  the  state 
model.  A  state  estimate  is  calculated  for  each  image  by  inverting  the  model  to  obtain  the 
state  vector  that  best  tits  the  measured  features.  Features  for  the  unmarked  baud  consist  of 
finger  link  and  tip  occluding  edges,  which  are  extracted  by  local  image  operators. 

Articulated  mechanisms  are  more  difficult  to  track  than  a  single  rigid  object  for  two 
reasons:  their  state  space  is  larger  ami  their  appearance  is  more  complicated.  First,  tin*  state 
space  must  represent  additional  kinematic  DOFs  not  present  in  the  single-object  case,  ami 
the  resulting  estimation  problem  is  more  expensive  computationally.  In  addition,  kinematic 
singularities  are  introduced  that  are  not  present  in  the  six  DOF  rase.  Singularities  arise 
when  a  small  ehange  in  a  given  state  has  no  effect  on  the  image  features.  They  are  currently 
dealt  with  by  stabilizing  the  estimation  algorithm.  Second,  high  DOF  mechanisms  produce 
complex  image  patterns  as  their  DOFs  are  exercised.  This  is  illustrated  in  Fig.  I.  where 
changes  in  the  pose  of  a  model  hand  are  shown  to  yield  dramatic  changes  in  its  silhouette. 
People  exploit  this  observation  in  making  shapes  from  shadows  cast  by  their  bands. 

To  reduce  the  complexity  of  the  hand  motion,  we  employ  a  high  image  acquisition  ate 
( 10-15  Hz  depending  on  the  model)  which  limits  the  change  in  the  hand  state,  ami  therefore 
image  feature  location,  between  frames.  As  a  result .  state  estimation  ami  feature  mea¬ 
surement  are  local,  rather  than  global,  search  problems.  In  the  state  space,  we  exploit  this 
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Figure  1:  Changes  in  the  hand  state  yield  significant  changes  in  appearance,  as  these  four 
configurations  of  the  model  hand  illustrate.  Views  (a)  and  (l>)  differ  only  in  the  pose  of  the 
hand,  as  do  (c)  and  (d);  while  views  (a)  and  (c)  differ  only  in  the  values  of  the  finger  joint 
angles.  Finger  links  are  modeled  with  cylinders,  and  finger  tips  with  hemispheres. 

locality  by  linearizing  the  nonlinear  state  model  arout.d  the  previous  estimate.  The  resulting 
linear  estimation  problem  produces  state  corrections  which  are  integrated  over  time  to  yield 
an  estimated  state  trajectory.  In  the  image,  the  projection  of  the  previous  estimate  through 
the  state  model  yields  coordinate  frames  fur  feature  extraction.  We  currently  assume  that 
the  closest  available  feature  is  the  correct  match,  which  limits  our  system  to  scenes  without 
occlusions  or  complicated  backgrounds. 

Previous  work  on  tracking  general  articulated  objects  includes  [IS.  12.  II).  In  !lsj. 
Yamamoto  and  Kosliikawa  describe  a  system  for  human  body  tracking  using  kinematic  and 
geometric  models.  1  hev  give  an  example  of  tracking  a  single  human  arm  and  torso  using 
optical  How  features.  Pentland  and  Horowitz  [12]  give  an  example  of  tracking  tin*  motion 
of  a  human  figure  using  optical  How  and  an  articulated  deformable  model.  In  [(»).  Dottier 
describes  a  system  for  interpret  ing  American  Sign  Language  from  image  sequences  of  a  single 
hand.  Doruer  s  system  uses  the  ftdl  set  of  t  he  hand's  DOFs,  and  employs  a  glove  wit  It  colored 
markers  to  simplify  feature  extraction.  A  much  earlier  system  by  O'Rourke  and  Badler  [11] 
analyzed  human  bo«lv  motion  using  constraint  propagation.  In  other  hand-specific  work. 
Kang  anti  Ikeucfii  descriln*  a  range  sensor-  baser  I  approach  to  hand  pose  estimation  [Sj.  used 
in  their  Assembly  Platt  from  Observation  system. 

Two  recent  works  [14.  I]  have  addressed  pose  estimation  of  articulated  objects  from  a 
single  view.  Dhome  et.  al.  recover  the  pose  ol  an  industrial  robot  arm  from  a  single  image 
anti  a  I 'AD  model  [1],  They  use  a  kinematic  representation  that  decouples  rotation  and 
translation  to  allow  for  more  efficient  global  search  of  the  state  space.  In  [l  |],  Shakuuaga 
derives  constraints  on  joint  angles  from  point  anti  line  measurements  and  gives  an  algorithm 
for  pose  recovery. 

In  addition  to  this  work  on  articulated  object  tracking,  several  aut  hors  have  applied  gen¬ 
eral  motion  techniques  to  human  motion  analysis.  In  contrast  to  Dtytlkt/t these  approac hs 
analyze  a  subset  of  the  total  ham!  motion,  such  as  a  set  of  gestures  [2]  or  the  rigid  motion  of 
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the  palm  [1].  Darrell  and  Pent  land  describe  a  system  for  learning  and  recognizing  dynamic 
hand  gestures  it  (2).  Their  approach  avoids  the  problems  of  band  modeling,  but  doe:. n't 
address  2D  tra«  king.  In  [I].  Blake  et.  al.  describe  a  real-time  contour  tracking  system  that 
can  follow  the  silhouette  of  a  rigid  hand  tinder  an  affine  motion  model. 

None  of  these  earlier  approaclts  have  demonstrated  tracking  results  for  the  full  state  of  a 
complicated  mechanism  like  the  human  hand,  using  natural  image  features.  Although  there 
has  been  a  significant  amount  of  gesture  recognition  work  on  unmarked  hand  imag«*s.  these 
approachs  don't  produce  31)  motion  estimates,  and  it  would  lie  difficult  to  apply  them  to 
problems  like  the  3D  mouse  interface  described  in  Subsect.  6.1.  S«>e  [16]  for  several  other 
examples  of  novel  user  interfaces  based  on  a  whole-hand  sensor. 

In  order  to  apply  the  Digit  Eyts system  to  specific  applications,  such  as  |{(  'I.  two  practical 
requirements  must  be  met.  First,  the  kinematics  and  geometry  of  the  target  hand  must  be 
known  in  advance,  so  that  a  state  mode!  can  be  constructed.  Second,  before  local  hand 
tracking  can  begin,  the  initial  configuration  of  the  hand  must  be  known.  We  achieve  this  in 
practice  by  requiring  the  subject  to  place  their  hand  in  a  certain  j*>se  and  fixation  to  initiate 
tracking.  A  2D  mouse  interface  based  on  visual  hand  tracking  is  presents!  in  Subsect.  6.1. 

In  the  sections  that  follow,  we  describe  the  Digit  Eg*  s  articulated  object  tracking  system 
in  more  detail,  along  with  the  specific  modeling  choices  required  for  hand  tracking. 

3  State  Model  for  Articulated  Mechanisms 

The  state  model  encodes  all  possible  mechanism  configurations  and  their  corresponding 
image  feature  patterns  as  a  two-part  mapping  between  stale  and  feature  spaces.  The  first 
part  is  a  kinematic  model  which  captures  all  possible  spatial  link  positions,  while  the  second 
part  is  a  feature  model  which  describes  the  image  appearntne  of  each  link  shape. 

3.1  Kinematic  Model:  Application  to  the  Human  Hand 

We  model  kinematic  chain-,  like  the  linger,  •virfi  the  Bertavif-llartenburg  (1)11)  representa¬ 
tion.  which  is  widely  used  in  robotics  [1  •}].  In  this  representation,  each  linger  link  has  an 
attached  link  coordinate  frame,  and  the  transformations  between  these  frames  model  the 
kinematics.  Since  feature  models  require  geometric  information  not  captured  in  the  kine¬ 
matics.  the  DH  description  of  each  link  is  augmented  with  an  additional  transform  from  tin* 
link  frame  to  a  shape  frame.  A  solid  model  in  the  shape  frame  generates  features  through 
projection  into  the  image. 

We  model  the  hand  as  a  collection  of  16  rigid  Imdies:  \  individual  finger  links  (called 
phalanges)  for  each  of  the  five  digits,  and  a  palm.  From  a  kinematic  viewpoint,  the  hand 
consists  of  multi-branched  kinematic  chains  attached  to  a  six  DOF  base.  We  make  several 
simplifying  assumptions  in  modeling  the  hand  kinematics.  First,  wo  assume  that  each  of 
the  four  fingers  of  the  hand  are  planar  mechanisms  with  four  degrees  of  freedom  (DOF). 
The  abduction  DOF  moves  the  plane  of  the  finger  relative  to  the  palm,  while  tin*  remaining 
•f  DOF  determine  the  finger's  configuration  within  the  plane.  Fig.  2  illustrates  the  planar 
finger  model.  Each  finger  has  an  anchor  point,  which  is  the  position  of  its  base  joint  center 
in  the  frame  of  the  palm,  which  is  assumed  to  be  rigid.  The  base  joint  is  the  one  farthest 
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Figure  2:  Kinematir  models,  illustrated  for  fourth  finger  and  thumb.  The  arrows  illustrate 
the  joint  axes  for  each  link  in  the  chain. 

(kinematically)  from  the  finger  tip.  We  use  a  four  parameter  quaternion  representation  of  the 
palm  pose,  which  eliminates  rotational  singularif ies  at  the  cost  of  a  redundant  parameter. 
The  total  hand  pose  is  described  hv  a  28  dimensional  state  vector. 

The  'ID  shape  of  the  hand  is  determined  by  the  shape  of  its  links  and  palm.  These  shapes 
can  be  given  by  solid  models,  or  a  class  of  deformable  models  as  in  [12].  Shape  models  are 
described  with  respect  to  the  shape  frame,  which  is  positioned  relative  to  the  link  coordinate 
frame.  In  general,  the  DH  transform  between  two  links  is  series  of  four  transforms 

77+l  =  Rot Trans :..<lTransJV,,Rotr.„,  .  II) 

In  our  framework,  the  shape  frame  is  located  after  the  first  transform,  and  so  the  kinematic 
to  shape  frame  transform  is  just  Hot..,),. 

The  thumb  is  the  most  difficult  digit  to  model,  tine  to  its  great  dexterity  and  intricate 
kinematics.  YVe  currently  employ  the  thumb  model  used  in  Rijpkema  and  (braid's  grasp 
modeling  system  [IS]  (see  Fig.  2).  They  were  able  to  obtain  realistic  animations  of  human 
grasps  using  a  five  DOF  model.  The  DH  parameters  for  the  first  author's  right  hand.  ust*d 
in  the  experiments,  can  be  found  in  Table  1. 

Real  fingers  deviate  from  our  modeling  assumptions  in  three  ways.  First,  most  lingers 
deviate  slightly  from  planarity.  This  deviation  could  be  modeled  with  additional  kinematic- 
transforms.  but  we  have  found  the  planar  approximation  to  b<*  adequate  in  prac  tice*.  Second, 
the  last  two  joints  of  the  finger,  counting  from  the  palm  outwards,  are  driven  by  the  same- 
tendon  and  are  not  capable  of  independent  actuation.  It  is  simpler  to  model  the  DOT 
explicitly,  however,  than  to  model  the  complicated  angular  relationship  between  t lie*  two 
joints.  The  third  and  most  significant  modeling  error  is  c  hange  in  'he  anchor  points  during 
motion.  We  have  modeled  the  palm  as  a  rigid  body,  but  in  reality  it  can  Ilex.  In  gripping 
a  baseball,  for  example,  the  palm  will  conform  to  its  surface,  causing  the  anchor  points  to 
deviate  from  their  rest  position  by  tens  of  millimeters.  Fortunately,  for  frc*e  motions  of  t lie- 
hand  in  space,  the  deviation  seems  to  be  small  enough  to  be  tolerated  by  our  system. 

The  modeling  framework  we  employ  is  general.  To  track  an  arbitrary  articulated  struc¬ 
ture,  one  simply  needs  its  DH  parameters  and  a  set  of  shape  models  that  describe  its  visual 
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Table  l:  Kinematic  and  shape  parameters  for  the  lirst  linger  and  thumb  of  the  first  author’s 
right  hand,  which  are  used  in  the  experiments.  State  variables  are  denoted  q,.  where  q,,  qt 
contain  the  quaternion  for  palm  rotation  and  qt  q.,  contain  palm  translation.  I  he  “Nest 
field  gives  t  he  number  u!  t  he  next  frame  in  t  he  kinemat  i<  chain.  I  he  of  her  t  lire**  lingers  are 
similar  to  tin-  lirst. 

appearance.  Within  the  snbproblem  of  hand  tracking,  this  allows  ns  to  develop  a  suite  oi 
hand  models  whose  DOFs  are  tailored  to  specific  applications. 

3.2  Feature  Model:  Description  of  Hand  Images 

The  output  of  t  lie  hand  state  model  is  a  set  of  feat  tin's  consist  ingof  lines  ami  points  generated 
by  the  projection  of  the  hand  model  into  the  image  plane.  Kadi  linger  link,  modeled  l»v  a 
cylinder,  generates  a  pair  of  lines  in  the  image  corresponding  to  its  occlusion  boundaries. 
Tin*  bisector  of  lln*se  lines,  which  contains  the  pro  jet' ion  ol  the  cylinder  central  axis,  is  tied 
as  the  link  feature.  The  link  feature  vector  [n/ipj  gives  the  parameters  of  tin*  line  equation 
us  +  by  —  />  =  <).  I’sing  the  central  axis  line  as  the  link  featur*’  eliminates  the  need  to  model 
t he  cylinder  radius  or  t  lie  slope  ol  I  lie  pair  of  lilies  relal  ive  to  the  cent  ral  axis,  which  oil en 
significant  near  the  linger  lips.  We  use  I  lie  entire  line  because  t  lie  endpoints  are  diiltcnlt  to 
measure  in  practice.  Fig.  I  shows  two  link  feature  lines  extracted  Iroiu  the  lir-t  two  links  ol 
a  linger. 

Kadi  linger  tip.  modeled  by  a  hemisphere,  generates  a  point  feature  by  projection  ol  the 
center  into  the  image.  I'lie  finger  tip  feature  vector  [.ryj  giv«*s  the  tip  position  in  image 
coordinates,  as  illustrated  m  Fig.  1.  I'lie  total  hand  appearance  is  described  by  a  i  .tin  -r  'In  ) 
dimensional  vector,  made  up  of  link  and  tip  features,  where  m  and  n  are  the  number  of 
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Figure  I:  Features  used  in  hand  tracking  are  illustrated  for  finger  links  1  anil  2.  and  the  tip. 
Each  infinite  line  feature  is  the  projection  of  the  finger  link  central  axis. 

finger  links  and  tips,  respectively,  in  the  model. 

Other  feature  choices  for  hand  tracking  are  possible,  but  the  occlusion  contours  are  the 
most  powerful  cue.  Hand  albedo  tends  to  be  uniform,  making  it  difficult  to  use  correlation 
features.  Shading  is  potentially  valuable,  but  t  he  complicated  illuminance  ami  self-shadowing 
of  the  hand  make  it  difficult  to  use. 

4  Feature  Measurement:  Detection  of  Finger  Links 
and  Tips 

Local  image-based  trackers  are  used  to  measure  hand  features.  These  trackers  a;e  the  pro¬ 
jections  of  the  spatial  hand  geometry  into  the  image  plane,  and  they  serve  to  localize  and 
simplify  feature  extraction.  A  linger  link  tracker,  drawn  as  a  "T"-shape.  is  depicted  along 
with  its  measured  line  feature  in  Fig.  I.  I'he  stem  of  the  is  the  projection  of  the  cylinder 
center  axis  into  the  image.  I  lie  image  sampling  rate  ensures  that  the  true  feature  location 
is  near  the  projected  tracker. 

Once  the  link  tracker  has  been  positioned,  line  features  an*  extracted  by  sampling  the 
image  in  slices  perpendicular  to  tin*  central  axis.  For  each  slice,  the  derivative  of  the  II) 
image  profile  is  computed.  Peaks  in  the  derivative  with  the  correct  sign  correspond  to  the 
intersection  of  the  slice  with  the  finger  silhouette.  The  extracted  intensity  profile  and  peak 
locations  for  a  single  slice  are  illustrated  in  Fig.  •">.  Line  lilting  to  each  set  of  two  or  more 
detected  intersections  gives  a  measurement  of  tin*  projected  link  axis.  If  only  one  silhouette 
line  is  detected  for  a  given  link,  the  cylinder  radius  can  be  used  to  extrapolate  the  axis  line 
location.  Currently.  I  lie  length  of  the  slices  (search  window)  is  fixed  by  hand.  Finger  lip 
positions  are  measured  through  a  similar  procedure. 

Fsing  local  trackers  and  sampling  along  lines  in  the  image  retimes  the  pixel  processing 
requirements  of  feature  measurement,  permitting  last  tracking. 


(i 


5  State  Estimation  for  Articulated  Mechanisms 


State  estimation  proceeds  by  making  incremental  stav  correction'  between  frames  One  <  v 

cle  of  the  estimation  algorithm  goes  as  follov.  s;  Ihe  current  state  estimate  i*  use*  l  to  . . let 

feature  locations  in  the  next  frame  and  position  featnr*  trackers.  Alter  imaue  a««pii'if  •  i 
and  featun*  extraction,  measured  and  predicted  feature  values  are  tompared  t<>  piodme  a 
state  corn*<*tioH.  which  is  added  to  the  current  estimate  to  obtain  a  new  state  estimate  I  lie 
difference  between  measured  and  predic*  •!  slates  is  modeled  by  a  residual  \e<toi.  and  tin- 
state  correction  is  obtained  by  minimi  its  mairuitude  Mptar*-d.  A  hit’ll  imaue  s.impliiit! 
rate  allows  us  to  linearize  the  nonline  tapping  from  state  to  features  around  an  operaf 
ing  point,  which  is  r**computed  at  each  trame.  to  obtain  a  linear  least  squares  problem  m 
the  model  Jacobian.  Tin*  following  sit  s  lions  describe  the  residual  model  and  estimation 
algorithm  in  detail. 

5.1  Residual  Model:  Link  and  Tip  Image  Alignment 

The  tip  residual  measures  the  Kuclidean  distance  in  the  imaue  b«-twe,-n  predi«*ed  >c  >  and 
measured  (t,)  tip  positions.  I  he  residual  tor  the  / 1 ii  tip  feature  is  a  vector  m  the  imaue 
plane  defined  by 

v.(q)  =  c, ( q )  -  t,  . 

where  c,  is  the  project  ion  of  the  tip  center  into  t  he  imaue  as  .1  fun*  t  ion  **f  the  iiaml  state 
III*,  link  residual  is  a  scalar  that  measures  !  he  de\  ml  ion  ■>!  the  p;.i|ected  <\lmd<-t  .im- 
from  tin*  m«*asure*l  fixture  line.  It  is  illustrated  for  a  simile  httc«*r  link  m  I  it*  I  I  lie 
residual  at  a  point  .living  the  axis  eipials  the  pet  pend  n  ulai  di't  am  e  to  the  hat  m<-  him  We 
incorporate  t be  or* liographic  camera  model  into  t  lie  residual  <-<|uat  ion  bv  seitim:  111  «  Ail 

and  writing 

/,(q)  =  m'pdq)  ~  i>  ■  1  *• 

when*  pdq)  is  the  -l!)  position  of  a  point  on  tin*  cylinder  link  in  camera  coordinates,  and 
[«  h  pi\  are  the  line  feature  parameters.  1  lie  total  link  residual  consists  of  one  or  more  point 
residuals  along  the  cylinder  axis  (at  tin*  base  and  tipi,  each  given  bv  i  f|.  Note  that  both 
ri’sidnals  are  linear  in  tin*  model  point  positions. 

I  lie  feature  residuals  for  each  link  and  tip  in  tin*  model  are  eoneatetiaied  into  a  'indie 
residual  vector.  R(q).  If  Hie  magnitude  of  the  residual  vector  is  zero,  the  hand  model  i« 
l>erfect|y  aligned  with  the  imat>e  data. 

5.2  Estimation  Algorithm:  Nonlinear  Least  Squares 

The  stale  correction  is  obtained  from  tin*  residual  vector  by  minimi/im; 

H(q)  =  ^  ||  R(q)  ||*  ■  1  I  • 

We  employ  a  modilied  ( i. mss- Newton  ((IN)  ali’oritbm  to  solve  this  uonliueat  least  'ipiates 
probh'tti  [:|] .  The  source  of  nonlinearity  in  tlie  state  iihkIcI  Ibr  articulated  me*  liaiii'iis  is 


Figure  4:  Image  trackers,  detected  features,  and  residuals  for  a  link  and  a  tip  are  shown 
using  the  image  from  Fig.  4.  Slashed  lines  denote  the  link  residual  error  between  the  T- 
shaped  tracker  and  its  extracted  line  measurement.  Similarly,  the  tip  tracker  (carat  shape) 
is  connected  to  its  point  feature  (cross)  by  a  residual  vector. 

trigonometric  terms  in  the  forward  kinematic  model.  The  other  source  of  nonlinearity, 
inverse  depth  coefficients  in  the  perspective  camera  model,  is  absent  in  our  orthographic 
formulation. 

Let  R(qj)  be  the  residual  vector  for  image  j.  The  GN  state  update  equation  is  given  by 

q,+«  =  q,  -  (J'J,  +  s|-' .  (■») 

where  Jj  is  the  Jacobian  matrix  for  the  residual  RJ(  both  of  which  are  evaluated  at  q  .  S 
is  a  constant  diagonal  conditioning  matrix  used  to  stabilize  the  least  squares  solution.  J, 
is  formed  from  the  link  and  tip  residual  Jacobian*.  The  same  basic  approach  was  used  bv 
Lowe  in  his  rigid  body  tracking  system  (!)|. 

Other  tracking  work  has  employed  Kalman  Filtering  to  incorporate  dynamic  constraints 
into  state  estimation  (I.  7,  17.  •*>].  The  update  rule  in  (.*>)  can  be  viewed  as  the  limiting  case 
of  this  filter,  in  which  the  estimate  is  a  function  of  the  measurements  alone.  The  complicated 
dynamics  of  the  hand  and  its  ability  to  accelerate  rapidly  weaken  the  effectiveness  of  dynamic 
constraints  (compared,  for  example,  to  satellite  tracking  problems).  Time  smoothing  may 
Ih»  useful  in  some  applications,  but  the  kinematic  hand  model  provides  a  much  stronger 
constraint  on  feature  iocations  and  notential  match*. 

In  the  remainder  of  this  section,  we  derive  the  link  and  tip  Jacobians  and  discuss  their 
computation.  To  calculate  the  link  Jacobian  we  differentiate  (J)  with  respect  to  the  state 
vector,  obtaining 

r)IAq)  _f0p,(q) 

— - —  =  m  — - —  .  <(>) 

oq  oq 

The  above  gradient  vector  lor  link  i  is  one  row  of  tin*  total  Jacobian  matrix.  ( leomel rically. 
it  is  formed  by  projecting  the  hint  malic  Jacobian  for  points  on  the  link.  <)p,(q  )/<Vq.  in  tin- 
direction  of  the  feature  edg'*  normal.  Similarly,  the  tip  Jacobian  is  obtained  as 

tfv.(q)  tfp.(q) 

- -  =  — - -  .  1 1 1 

a  q  <)q 
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Figure  5:  A  single  link  tracker  is  shown  along  with  its  detected  boundary  points.  One  slice 
through  the  finger  image  of  a  finger  is  also  depicted.  Peaks  in  the  derivative  give  the  edge 
locations. 


The  kinematic  Jacobians  in  (6)  and  (7)  are  composed  of  terms  of  the  form  dp,/d<\j,  which 
arise  frequently  in  robot  control.  As  a  result,  these  Jacobian  entries  can  be  obtained  directly 
from  the  model  kinematics  by  means  of  some  standard  formulas  (see  [15],  Chapter  5).  There 
are  three  types  of  Jacobians.  corresponding  to  joint  rotation,  spatial  translation,  and  spatial 
rotation  DOFs.  All  points  must  be  expressed  in  the  frame  of  the  camera  producing  the 
measurements.  For  a  revolute  (rotational)  DOF  joint  qj  we  have 


dp, 

dq, 


w j  x  (p,  -  df.)  , 


(8) 


where  is  the  rotation  axis  for  joint  j  expressed  in  the  camera  frame,  and  d^.  is  the  position 
of  the  joint  j  frame  in  camera  coords.  There  will  be  a  similar  calculation  for  rnch  camera 
being  used  to  produce  measurements. 

The  Jacobian  calculation  for  the  palm  DOFs  must  refiect  the  fact  that  palm  motion 
takes  place  with  respect  to  the  world  coordinate  frame,  but  must  be  expressed  in  the  camera 
frame.  We  obtain  the  translation  component  as 


dp, 

dv 


<») 


where  v  is  the  palm  velocity  with  respect  to  the  world  frame  and  R“’  is  the  camera  to  world 
rotation.  Similarly,  if  q,  is  a  component  of  the  quaternion  specifying  palm  rotation,  we 
obtain 


dp. 

dqj 


[R?JwJj  x  p-  » 


(10) 


where  J„,  is  a  Jacobian  mapping  quaternion  velocity  to  angular  velocity,  and  [•],  denotes  the 
yth  column  of  a  matrix. 

The  details  of  the  derivation  are  contained  in  Appendix  A. 


5.3  Tracking  with  Multiple  Cameras 

The  tracking  framework  presenter!  above  generalizes  easily  to  more  than  one  camera.  When 
multiple  cameras  are  used,  the  residual  vectors  from  each  camera  are  concatenated  to  form 


a  single  global  residual  vector.  This  formulation  can  exploit  partial  observations.  If  a 
finger  link  is  visible  in  one  view  but  not  in  the  another  due  to  occlusion,  the  single  view 
measurement  is  still  incorporated  into  the  tesidual,  and  therefore  the  estimate. 

6  Experimental  Results 

To  test  the  articulated  tracking  framework  described  al>ove,  we  developed  two  hand  tracking 
systems  based  on  reduced  and  full-state  hand  models,  using  one  and  two  cameras.  The 
reduced  hand  model  was  used  with  a  single  camera  to  provide  input  to  a  ID  cursor  interface. 
The  full  hand  model  was  tracked  using  two  image  sequences.  In  both  rases  we  provide 
recorded  state  trajectory  estimates  along  with  graphical  output. 

6.1  3D  Graphical  Mouse  Using  a  Single  Camera 

For  the  first  tracking  experiment,  we  applied  the  DigitEyrs  system  to  a  3D  mouse  interface 
problem.  Figure  6  shows  an  example  of  a  simple  3D  graphical  environment,  consisting  of 
a  ground  plane,  a  3D  cursor  (drawn  as  a  pole,  with  the  cursor  at  the  top),  and  a  spherical 
object  (for  manipulation.)  Shadows  generate  additional  depth  cues.  The  interface  problem 
is  to  provide  the  user  with  control  of  the  cursor's  three  DOFs,  and  thereby  the  means  to 
manipulate  objects  in  the  environment.  In  the  standard  "mouse  pole”  solution,  the  3D  cursor 
position  is  controlled  by  clever  use  of  a  standard  2D  physical  mouse.  Normal  mouse  motion 
controls  the  pole  base  position  in  the  plane,  while  depressing  one  of  the  mouse  buttons 
switchs  reference  planes,  causing  mouse  motion  in  one  direction  to  control  the  pole  (cursor) 
height.  By  switching  between  planes,  the  user  can  plac  e  the  cursor  arbitrarily.  ( 'ommamling 
continuous  motion  with  this  interface  is  awkward,  however,  and  tracing  an  arbitrary,  smooth 
space  curve  is  nearly  impossible. 

In  the  DigitEyrs  solution  to  the  3D  mouse  problem,  the  3  input  DOFs  are  derived  from 
a  partial  hand  model,  which  consists  of  the  first  and  fourth  fingers  of  the  hand,  along  with 
the  thumb.  The  palm  is  constrained  to  lie  in  the  plane  of  the  table  used  in  the  interface,  and 
thus  has  3  DOF.  The  first  finger  has  3  articulated  DOFs,  while  the  fourth  finger  and  thumb 
each  have  a  single  DOF  allowing  them  to  rotate  in  the  plan  of  the  table  (abduct).  The  hand 
model  is  illustrated  in  Fig.  7.  A  single  camera  oriented  at  approximately  45  degrees  to  the 
table  top  acquires  the  images  used  in  tracking.  The  palm  position  in  the  plane  controls  the 
base  position  of  the  pole,  while  the  height  of  the  index  finger  al>ovo  the  table  controls  the 
height  of  the  cursor.  This  particular  mapping  has  the  important  advantage  of  decoupling 
the  controled  DOFs,  while  making  it  possible  to  operate  them  simultaneously.  For  example, 
the  user  can  change  the  pole  height  while  leaving  the  base  position  constant.  The  fourth 
finger  and  thumb  have  abduction  DOFs  in  the  plane,  and  are  used  as  “buttons". 

Figures  8-10  give  experimental  results  from  a  500  frame  motion  sequence  in  which  the 
estimated  hand  state  was  used  to  drive  the  3D  mouse  interface  (Implementation  details  are 
given  in  Sec.  7.)  Figures  8  and  9  show  the  estimated  hand  state  for  each  frame  in  the  image 
sequence.  Frames  were  acquired  at  100  ms  sampling  intervals.  The  pole  height  ami  base 
position  derived  from  the  hand  state  by  the  31)  mouse  interface  are  also  depicted  in  Fig.  9. 
The  motion  sequence  has  four  phases.  In  the  first  phase  (frame  0  to  150).  the  user's  finger 
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Figure  6:  A  sample  graphical  environment  for  a  3D  mouse.  The  3D  cursor  is  at  the  tip  of 
the  “mouse  pole",  which  sits  atop  the  ground  plane  (in  the  foreground,  at  the  right).  The 
sphere  is  an  example  of  an  object  to  be  manipulated,  and  the  line  drawn  from  the  mouse  to 
the  sphere  indicates  its  selection  for  manipulation. 

is  raised  and  lowered  twice,  producing  two  peaks  in  the  pole  height,  with  a  small  variation 
in  the  estimated  pole  position.  Second,  around  frame  150  the  finger  is  raised  again  and  kept 
elevated,  while  the  thumb  is  actuated,  as  for  a  “button  event”.  The  actuation  period  is 
from  frame  150  to  frame  '200,  and  results  in  some  change  in  the  pole  height,  but  negligible 
change  in  pole  position.  Third,  from  200  to  350.  the  pole  height  is  held  constant  while  the 
pole  position  is  varied.  Finally,  from  350  to  the  end  of  the  sequence  all  states  are  varied 
simultaneously.  Sample  mouse  pole  positions  throughout  the  sequence  are  illustrated  in 
Fig.  10  (at  the  end  of  the  report.)  This  is  the  same  scene  as  in  Fig.  6.  except  that  the  mouse 
pole  height  and  position  change  as  a  function  of  the  estimated  hand  state.  A  hand  image 
from  the  middle  of  the  sequence  (frame  200)  is  shown  in  Fig.  7  along  with  the  estimated 
hand  model  state. 

These  results  demonstrate  fairly  good  decoupling  between  the  desired  states  and  a  useful 
dynamic  range  of  motion.  The  largest  coupling  error  occurs  around  frame  150  when  the  pole 
height  drops  as  the  thumb  is  actuated.  This  coupling  could  be  compensated  for  by  storing 
a  list  of  estimated  pole  heights  and  restoring  the  height  to  its  previous  value  when  the  onset 
of  thumb  actuation  is  detected.  In  this  experiment,  the  mouse  state  is  generated  from  the 
hand  state  by  a  simple  scaling  and  coordinate  change.  An  unfortunate  side-effect,  of  scaling 
is  to  amplify  the  noise  in  the  estimator.  More  sophisticated  schemes  based  on  smoothing 
the  state  prior  to  its  use  would  likely  improve  the  output  quality. 

This  example  illustrates  an  important  advantage  of  hand  tracking  with  kinematic  models: 
absolute  3D  distances  (such  as  finger  height  above  a  table)  can  be  measured  from  a  single 
camera  image.  The  ability  to  recover  3D  spatial  quantities  from  hand  motion  is  one  of  the 
advantages  our  system  has  over  approachs  based  on  gesture  recognition. 


Figure  7:  The  hand  model  used  in  the  3D  mouse  application  is  illustrated  for  frame  200  in 
the  motion  sequence  from  Fig.  9.  The  vertical  line  shows  the  height  of  the  tip  above  the 
ground  plane.  The  input  hand  image  (frame  200)  demonstrates  the  finger  motion  used  in 
extending  the  cursor  height. 

6.2  Whole  Hand  Tracking  With  Two  Cameras 

In  the  second  tracking  experiment,  the  DigilEyes  system  was  used  to  track  a  full  27  DOF 
hand  model,  using  two  camera  image  sequences.  Because  the  hand  motion  must  avoir! 
occlusions  for  successful  tracking,  the  available  range  of  travel  is  not  large,  ft  is  sufficient, 
however,  to  demonstrate  recovery  of  articulated  DOFs  in  conjunction  with  palm  mot  ion. 
Figure  II  shows  sample  images,  trackers,  and  features  from  both  cameras  at  three  points 
along  a  200  frame  sequence.  The  two  cameras  were  set  up  about  a  foot  and  a  half  apart 
with  optical  centers  verging  near  the  middle  of  the  tracking  area,  intersecting  the  table 
surface  at  approximately  45  degrees.  Fig.  12  shows  the  estimated  model  configurations 
corresponding  to  the  sample  points.  In  the  left  column,  the  estimated  model  is  rendered 
from  the  viewpoint  of  the  first  camera.  In  the  right  column,  it  is  shown  from  an  arbitrary 
viewpoint,  demonstrating  the  3D  nature  of  our  tracking  result.  A  subset  of  the  estimated 
state  trajectories  for  the  motion  sequence  are  given  in  Figs.  13  and  I  I. 

Direct  measurement  of  tracker  accuracy  is  difficult  due  to  the  lack  of  ground  truth  data. 
We  plan  to  use  a  Polhemus  sensor  to  measure  the  accuracy  of  the  6  DOF  palm  state  estimate. 
Obtaining  ground  truth  measurements  for  joint  angles  is  much  more  difficult.  One  possible 
solution  is  to  wear  an  invasive  sensor,  like  the  DataGlove,  to  obtain  a  baseline  measurement. 
By  fitting  the  DataGlove  inside  a  larger  unmarked  giove,  the  effect  of  the  external  finger 
sensors  on  the  feature  extraction  can  be  minimized. 


Finger  1  States 


Button  States 
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Figure  8:  Palm  rotation  and  finger  joint  angles  for  mouse  pole  hand  model  depicted  in 
Fig.  7.  Joint  angles  for  thumb  and  fourth  finger,  shown  on  right,  are  used  as  buttons.  Note 
the  “button  event”  signal  i  by  the  thumb  motion  around  frame  175. 
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Figure  9:  Translation  states  for  mouse  pole  hand  model  are  given  on  the  left.  The  V  axis 
motion  is  constrained  to  zero  due  to  tabletop.  On  the  right  are  the  mouse  pole  states, 
derived  from  the  hand  states  through  scaling  and  a  coordinate  change.  The  serpience  events 
goes:  0-150  finger  raise/lower,  150-200  thumb  actuation  only,  '200-350  base  translation  only, 
350-500  combined  3  DOF  motion. 


7  Implementation  Details 

The  DigitEyes  system  is  built  around  a  special  board  for  real-time  image  processing,  called 
IC40.  Each  IC40  board  contains  a  68010  CPI',  5  MB  of  dual-ported  RAM.  a  digitizer,  and 
a  video  generator.  The  key  feature  of  this  board  is  its  ability  to  deliver  digitized  images  to 
processor  memory  at  video  rate  with  no  computational  overhead.  This  removes  an  important 
bottleneck  in  most  workstation-based  tracking  systems.  Ordinary  C  code  can  be  compiled 
and  down-loaded  to  the  board  for  execution. 

In  the  multicamera  implementation,  there  is  an  IC40  board  for  each  camera.  The  total 
computation  is  divided  into  two  parts:  feature  extraction  and  state  estimation.  Feature 
extraction  is  done  in  parallel  by  each  board,  then  the  extracted  features  are  passed  over  the 
VME  bus  to  a  Sun  workstation,  which  combines  them  and  solves  the  resulting  least  squares 
problem  to  obtain  a  state  estimate.  Estimated  states  are  passed  over  the  Ethernet  to  a 
Silicon  Graphics  Indigo  2  workstation  for  model  rendering  and  display.  The  overall  system 
organization  is  shown  in  Fig.  15.  Our  experimental  testbed  for  hand  tracking  is  depicted  in 
Fig.  16. 

The  generality  of  our  tracking  framework  is  reflected  in  the  software  organization  of  the 
DigitEyes  system.  Different  trackers  can  be  generated  simply  by  changing  the  kinematic 
description  of  the  mechanism.  Feature  tracking  code  for  the  IC’40  boards  is  generated  au¬ 
tomatically  from  the  kinematic  description.  This  makes  it  possible  to  experiment  with  a 
variety  of  kinematic  models,  tailored  to  specific  hand  tracking  applications. 

8  Conclusion 

We  have  presented  a  visual  tracking  framework  for  high  DOF  articulated  mechanisms,  and 
its  implementation  in  a  tracking  system  called  DigitEyes.  We  have  demonstrated  real-time 
hand  tracking  of  a  27  DOF  hand  model  using  two  cameras.  We  will  extend  this  basic 
work  in  two  ways.  First,  we  will  modify  our  feature  extraction  process  to  handle  occlusions 
and  complicated  backgrounds.  Second,  we  will  analyze  the  observability  requirements  of 
articulated  object  tracking  and  address  the  question  of  camera  placement. 
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9  Appendix  A:  Spatial  Transform  Jacobian 

Given  the  camera  and  hand  position  in  world  coordinates,  we  outline  the  derivation  of  the 
Jacobian  for  a  point  expressed  in  the  camera  frame  tinder  rotation  and  translation  of  the 
palm.  We  start  with  the  basic  result 

pl=R>  +  R*'jxPt,  (ID 

where  v.u/  give  the  velocity  of  the  base  frame  in  world  coordinates.  Eqn  !)  follows  immedi¬ 
ately.  Substituting  the  additional  relation 

=  (l->) 

where  q  is  the  quaternion  parameterization  of  rotation  and  Jw  is  a  four  by  three  Jacobian 
matrix,  and  differentiating  with  respect  to  q,  yields  Eqn  10. 

To  obtain  Eqn  1*2.  we  start  with  the  relation 

R(q)  =  S(w)R(q). 

and  solve  it  for  S(u>),  a  skew  symmetric  matrix  in  the  angular  velocity.  The  other  side  is 
then  a  matrix  of  linear  equations  in  the  q,.  Eqn  1*2  results  from  equating  the  individual 
components  of  with  their  linear  representations  in  q. 
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Figure  10:  The  mouse  pole  cursor  at  six  positions  during  the  motion  sequence  of  Fig.  S 
The  pole  is  the  vertical  line  with  a  horizontal  shadow,  and  is  the  only  thing  moving  in  tin 
sequence.  Samples  were  taken  at  frames  0,  d0,  7ri,  2W),  dOO,  and  d70  (chosen  to  illustrate 
the  range  of  motion). 


Camera  0  View 


Camera  1  View 


Figure  11:  Three  pairs  of  hand  images  from  the  continuous  motion  estimate  plotted  in 
Figs.  13  and  14.  Each  stereo  pair  was  obtained  automatically  during  tracking  by  storing 
every  fiftieth  image  set  to  disk.  The  samples  correspond  to  frames  If).  !)!).  and  I  I!). 
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Palm  Rotation  Palm  Translation 
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Figure  13:  Estimated  palm  rotation  and  translation  for  motion  sequence  of  entire  hand. 
Qu.-Q*  are  the  quaternion  components  of  rotation,  while  Tr-T..  are  the  translation.  The 
sequence  lasted  ‘20  seconds. 
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Figure  14:  Estimated  joint  angles  for  the  first  finger  and  thumb.  The  other  three  fingers  are 
similar  to  the  first.  Refer  to  Fig.  2  for  variable  definitions. 
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Figure  15:  The  hardware  architecture  for  our  current  hand  tracking  system. 


Figure  16:  Experimental  test  bed  for  the  Digit  Eyes  system. 


